Faster exact Markovian probability functions for motif 
occurrences: a DFA-only approach 



Paolo Ribeca, Emanuele Raineri* 

Systems Biology Unit, Center for Genomic Regulation, C/Dr.Aiguader 88, E08003 Barcelona, Spain 

February 5, 2008 



Abstract 

Background: The computation of the statistical properties of motif occurrences has an 
obviously relevant practical application: for example, patterns that are significantly over- or 
under-represented in the genome are interesting candidates for biological roles. However, the 
problem is computationally hard; as a result, virtually all the existing pipelines (for instance 
[I]) use fast but approximate scoring functions, in spite of the fact that they have been 
shown to systematically produce incorrect results [2] [3]. A few interesting exact approaches 
are known [2] [J, but they are very slow and hence not practical in the case of realistic 
sequences. Results: We give an exact solution, solely based on deterministic finite-state 
automata (DFAs), to the problem of finding not only the p- value, but the whole relevant 
part of the Markovian probability distribution function of a motif in a biological sequence. In 
particular, the time complexity of the algorithm in the most interesting regimes is far better 
than that of [2], which was the fastest similar exact algorithm known to date; in many cases, 
even approximate methods are outperformed. Conclusions: DFAs are a standard tool of 
computer science for the study of patterns, but so far they have been sparingly used in the 
study of biological motifs. Previous works 2 5 do propose algorithms involving automata, 
but there they are used respectively as a first step to build a Finite Markov Chain Imbedding 
(FMCI), or to write a generating function: whereas we only rely on the concept of DFA 
to perform the calculations. This innovative approach can realistically be used for exact 
statistical studies of very long genomes and protein sequences, as we illustrate with some 
examples on the scale of the human genome. 



1 Introduction 



It is difficult for analysis tools to meet the challenge represented by the ever-increasing data 
flux coming from high-throughput post-genomics experiments. One sector where this problem is 
particularly evident is that of sequence analysis. 

As it is well known, for example, the detection of statistically relevant nucleotide sequences 
that occur repeatedly in a (possibly very long) stretch of genetic material has interesting biological 
aspects. Motifs appearing significantly more or less than mere chance would dictate can hint to the 
presence of relevant regulatory regions, e.g. promoters or tandem repeats. Statistical properties 
may relate to structural ones as well: recently, for example, a 117Kb periodicity in E.coli has been 
linked to topological features of its chromosomes [6]; the exceptional statistics of the 11 crossover 
hot spot instigator" close to the genome core common to different strains of E. coli can be possibly 
be linked to the protection of that part of the chromosome from recombination events [7] . 

However, it is not easy to pinpoint what statistically relevant exactly means in the context of 
sequence analysis. In particular, the problem is computationally hard for at least two reasons: 
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1. given a sequence of symbols taken from an alphabet A of size a, the number of possible 
sequences of length L is a L , that is, the number of possible strings grows exponentially with 
the length of the considered stretch. It should then not come as a surprise to the reader the 
fact that, at least in one approach found in the literature, the solution is given in terms of 
an NP-hard algorithm [4]. 

2. complicated motifs can overlap in non-trivial ways, making simple statistical approximations 
unreliable and requiring the exploitation of more sophisticated analytic techniques. As a 
matter of fact, it has been proved that in many common situations approximate methods 
do systematically fail to predict correct statistical estimators [2] [3]; thus, possessing a fast 
non-approximate method would seem essential to provide solid and unbiased foundations to 
characterize under- and over-represented motifs from the point of view of their biological 
role. 

In fact, a good deal of attention has been devoted in the recent past to the problem of detecting 
DNA motifs which appear with anomalous frequency inside a genome, and this problem has 
prompted the development of a diverse range of tools and algorithms for the study of the statistics 
of pattern occurrences [1][8 9 . However, to be computationally feasible most of the proposed 
methods involve either quite drastic approximations on the statistical model which is used to 
describe the genome (see for example [1]), or some additional information (e.g., a training set) to 
be supplied by the user (see for example [TO]). 

Although very slow in many realistic regimes and as a consequence probably unpractical for 
everyday use, a few algorithms to compute exact probability distribution functions for motif 
occurrences are known. We distinguish two main methods: the first one, based on position- 
weighted matrices, is presented in [4]; the second one, introduced in [2], takes advantage of Finite 
Markov Chain Imbeddings (FMCIs) using Deterministic Finite-state Automata (DFAs) to deduce 
the Markovian transition matrix of the model. 

In this paper we introduce a third exact method. The general statistical setup is similar to 
that of the latter method (we take as null hypothesis the fact that our sequence is generated by a 
Markov model of arbitrary order), but the proposed algorithm is very different. In particular, we 
show that: 

1. contrarily to what all the recent literature about exactly solved Markov models would seem 
to imply, FMCIs are completely inessential to evaluate the probability distribution functions 



2. a simpler approach entirely based on DFAs is possible. The simpler resulting algorithm 
naturally lends itself to a much better optimization, making the exact analysis of genomes 
of realistic size feasible. In fact, in many regimes the obtained performance is even better 
than that of the approximate large-deviations and Gaussian models of [3] (see Table [2]). 

As an application, we produce in Section [5] an analysis of various motifs in the human X chro- 
mosome (L ~ 1.5 ■ 10 s ), and of more than 16.000 transcription-factor binding sites in S.cerevisiae 
(L ~ 1.2 • 10 7 ). Both cases are out of reach for exact FMCI methods (see Table [2]). 

1.1 Background about exact Markovian methods 

Defining L as the length of the sequence under analysis, N \, s as the number of observed occurrences 
of the examined motif m, a as the number of symbols in the alphabet A, and s as the number 
of states of a DFA which is able to recognize and count the motif m (see Section l2~Tj) , the exact 
algorithm presented in [5] to compute the p-value of m for a Markov model of order m runs in a 
time proportional to 



It is not immediate to get a practical comprehension of the meaning of such an expression, so we 
briefly analyze it here. We can distinguish two main relevant asymptotic regimes: the short-pattern 
regime, and the long-pattern regime. 



of motifs. 




(1) 



2 



In the short-pattern regime, the cost of the algorithm turns out to be essentially quadratic in 
the length L of the considered sequence. In fact, in the case of m = and uniform probability 
distribution of symbols the typical recorded number of observed pattern occurrences may be 
estimated as 

N obs ~- e , (2) 

being £ the length of the pattern; similar results hold for more complicated Markov models. 
Inserting this estimate into {!]), one immediately realizes that in this regime the cost becomes 
proportional to L 2 . 

On the other hand, for long patterns the cost is linear in L: in fact — as suggested by ((2]) 
again — the typical occurrence numbers in this case are N ^ s = or iV bs = 1 , so the cost becomes 
essentially independent of iV bs and proportional to L. 

Of course, being the size of realistic biological sequences very large (for example, in the range 
10 7 -10 10 for the case of a typical genome) an algorithm which is linear, or — even worse — quadratic 
in L is essentially unpractical: both the long- and the short-pattern regimes will be unaccessiblc 
to it. 

1.2 Results and discussion 

In this paper we show how a formulation of motif counting in terms of systolic DFAs (see Sec- 
tioning]) allows us to directly deduce an algorithm with cost O (a x s x (iV bs + 1) x L), that is, 
equivalent in complexity to that presented in [2]. Furthermore, additional formal developments 
described in S ect ion [2 . 3 1 make it possible to write the probability distribution function p(x) of the 
occurrences of a motif m as 

p(x)=Tr((M(x)) L .v ), (3) 

where, as explained in Section [2.31 i>o is an M- valued vector of length s; in turn, A4(x) is an s x s 
sparse square matrix with a x s non-zero elements, each of its elements being a polynomial in x of 
degree ^ — L/s (see Section H2T3"]) . Of course, through their construction M. and vo depend 

paramctrically both on the order m of the Markov model and on the pattern m being examined; 
however, for the sake of notational simplicity we will not indicate this fact explicitly in the rest of 
the paper, as much as we will often drop the dependence of M. on x as well. 
There are two main possible evaluation strategies for such an expression: 

1. we compute p(x) as 

p(x)=M-(M-(...-(M-v )...)); 

we are able to take advantage of the sparsity of the matrix Ad, but the resulting algorithm 
has a final complexity of 

o(«xsx (AUs + 1) 2 x i) , 

being thus slower than the FMCI method in [2] due essentially to the quadratic cost of 
polynomial multiplication of matrix elements. 

2. we compute M. L directly by logarithmic decomposition, and its product with the initial 
condition vector vq in the end. The naive cost of such a scheme is now 

o(s 3 x(iV b s + l) 2 xlogL), 

which is much less than {TJ) in the linear long-pattern regime. 

In addition, in Section [3] we observe that it is possible to use the FFT algorithm to perform 
polynomial multiplication as a convolution of the coefficients; as a result, the whole bulk of 
p{x) may be obtained with a complexity of 

0(s 2 x logi x L) (4) 
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which is much faster than that of the algorithm of [5] in the quadratic short-pattern regime. 
In addition, defining pjx) as the truncated distribution obtained from p(x) by suppressing 
the tail regions as long as p{x) < emaxp, and introducing the quantity 



base e (p) := length(suppp E ) , 



(5) 



a successive refinement of the method allows to obtain for this technique an even better final 



which is excellent in the case of the evaluation of both the linear and the quadratic regime, 
given that base e (p) <C L in all practical cases (see Table [5]) . 

To illustrate the quality of the results just obtained we consider two examples from real-life situ- 
ations in the case of a Markov model of order 1: 

1. a very long sequence (i.e. the entire human genome, ~ 3 • 10 9 base pairs) and a long pattern 
of length 16, such that iV bs = 1- In this case, one can deduce from ([6]) that the complexity 
speedup obtained by our algorithm w.r.t. the FMCI-based one would be ~ 2 • 10 5 . 

2. an entire human chromosome (~ 1.5 • 10 8 base pairs) and a very short pattern, ATC for 
example; this is the nastiest possible case for the exact algorithm of [2], since it falls in its 
quadratic regime. Here our algorithm outperforms the FMCI one by a factor of ~ 5 • 10 6 . 

In fact, Equation ((SJ) tells us that the performance of our exact technique is quite close to that of 
the approximate Gaussian method described in [3] (which is essentially proportional to s x log L) ; 
maybe more surprisingly, in the most interesting regime of moderate to, long sequences and not too 
short patterns our method is in general even faster than both the large-deviations and the Gaussian 
approximations, and possibly much faster (see Table [2] below) . In addition, when thinking about 
these comparisons it should not be forgotten that — unlike approximate methods — our faster 
scheme is still able to produce the whole bulk of the probability distribution, from which all the 
interesting statistical quantities can be straightforwardly computed. A more complete comparison 
of the existing Markov methods to ours is presented in Section together with a discussion of the 
timings and the results obtained for some test examples. 

2 System and methods 

It is very easy to write a computer program which, given a stretch of DNA or protein of length L, 
finds all the contained sub-patterns of any length i together with the corresponding frequencies. 
As mentioned before, however, supposing that a given motif has been found -/V b s times it is not 
easy to define and compute what the expected occurrence probability p(N Q \>s) should be: i.e.. it 
is not easy to guess whether the measured number iV bs is a priori large or small. 

In fact, a satisfactory and very well-known conceptual framework to tackle similar problems has 
been formulated decades ago in the context of computer science, where the ability of recognizing 
("parsing") symbol strings in programs is essential; it is based on deterministic finite-state au- 
tomata (DFA for shortness). Such theoretical devices are ubiquitous and fundamental in computer 
science, so we will not describe their principles here; we just mention that the interested reader 
may easily find many thorough introductions to the field in standard computer-science literature 
— one classic reference for this subject being for example [TT] , 

What makes DFAs particularly interesting from the point of view of biological analysis is 
that they can naturally be linked to Markov models of sequences. In fact, we can formulate the 
Markovian null hypothesis that our stretch of DNA or protein is completely determined by its 
statistics of order to, i.e. by the frequency of appearance of each unique sub-string of length to + 1 
(with to ^ 0) contained in the sequence; in this case, if we apply to the string being examined a 
sliding window of length m + 1 , we can straightforwardly reinterpret the stretch as a sequence of 



cost of 




(6) 
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consecutive transitions between groups of m + 1 symbols, and link the Markov chain statistics to 
the probability for the DFA to make a transition from a state to another one. 

This remarkable fact has been realized only recently in the context of the analysis of biological 
sequences [2J; however, although leading to the relatively fast FMCI-based algorithm mentioned 
before, the approach presented in [2] uses DFAs just as auxiliary tools to produce the transition 
matrix between states of the underlying Markov process. 

In the present article we show how the statistical description of a sequence in terms of pure 
DFAs is worthwhile by itself, and may lead to much faster numerical evaluation schemes. To 
make the reader more at ease, throughout the following sections we will illustrate the formal 
developments of our technique by means of a worked example in the case of statistics of order 0. 

2.1 A case study 

Let us consider a stretch of genome with L = 10 in which the motif ATC appears two times. To 
decide if this frequency means that ATC is for some reason overrepresented, given some Markovian 
statistics we must compute the probability that such a pattern can appear randomly two times in 
a genome stretch of length 10, and compare it to the observed probability of the event. More in 
general, how to calculate the probability of ATC to appear n times? 




Figure 1: An infinite DFA able to recognize and count all occurrences of motif ATC. When the automaton 
is in states 0, 1 or 2 the pattern has occurred exactly times; when the automaton is in states 3, 4 or 5 
the pattern has occurred 1 times; and so on. State E is the end-of-input. 

Of course, one could try to give an answer in terms of direct enumeration. In fact, this approach 
works quite well for very short sequences; however, the reader can easily convince themselves that 
the method is both very expensive and very difficult to generalize algorithmically as the length L 
of the sequence increases, and it becomes completely unpractical for the analysis of genomes of 
realistic size (i.e., L ~ 10 7 -10 9 bases). 

To answer the question in more general terms, we begin by writing down the DFA of Figure [TJ 




Figure 2: The basic building block of Figure [T] The complete automaton is obtained by replicating this 
block as many times as the number of observed occurrences, plus one. 

Such a DFA is obtained by chaining three identical basic blocks as that represented in Figure [2J 
each one matching one occurrence of the original ATC motif. Starting from state 0, the automaton 
works by reading in one character at the time from the stretch of genome, and changing its internal 
state step by step according to the input; for example, the automaton of Figure [T] when processing 
the string CAATCGTCATCG will run through the states 0, 1, 1, 2, 3, 3, 3, 3, 4, 5, 6, 6. The meaning 
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of the states is thus as follows: when the automaton is in state 0, it is waiting to read the first 
A; in state 1, it is waiting for a T after having read one or more As; in state 3, exactly one ATC 
has been read, possibly preceded by any string different from ATC and followed by any number of 
bases different from A; in state 6. exactly two ATCs have been read, preceded and separated by 
any possible string different from ATC, and followed by any string which is not a substring of ATC; 
and likewise for any other state (the special character _Lmcaning the end of input, so that state 
E is the final one, reached after all the input string has been consumed by the automaton). We 
observe that the construction of the automaton is far from being trivial; however, as emphasized 
before standard algorithms to solve the problem are known since a long time and may easily be 
found in the literature [ITj. Chaining more than one basic block as in Figure [T] allows us to count 
the number of instances of ATC found in the genome: by construction, when processing any of the 
genomes which contain exactly two occurrences of pattern ATC the automaton will always end up 
in states 6, 7 or 8. 

So far, we have thus succeeded in restating our original problem: all the genomes we are 
interested in {e.g. those containing two instances of pattern ATC) are the genomes that make 
the automaton of Figure [1] terminate by reaching state E through states 6, 7 or 8. How to 
extract direct numerical information out of this statement? There are basically two answers to 
this question. 

The first answer is that by the method of generating functions one can work out analytically 
the probability for the automaton to be in state 6, 7 or 8 (and in all other states as well) after 
having read any arbitrary input; again, introductory examples to this computational procedure 
may be found in [T2] . This method is appealing since it allows the computation of the probability 
distribution function in closed form, but it suffers from an obvious important drawback: for a 
sequence of length L it requires a Taylor expansion of order L\ given that the typical size of a 
relatively short bacterial genome is already in the range of 10 6 -10 7 bases, one would expect this 
technique to be unpractical when analyzing realistic stretches of genome. Although this is indeed 
the case, it is worthwhile to note that the method has nonetheless been exploited successfully in 
the context of biological research for the case of shorter genomes [5] ; this result may serve as a 
useful reference. 

2.2 The systolic DFA 

We observe that a second approach is possible, which is simpler, easier to implement and more 
amenable to direct interpretation and fast numerical evaluation. Quite surprisingly, to the best of 
the authors' knowledge this powerful method does not seem to have been proposed before for any 
relevant practical application, neither in computer science nor in biology. 




Figure 3: The automaton of Figure \T\ reinterpreted as a systolic array. Transitions to the end-of-input 
state E have been omitted. 

To implement it, we turn the DFA of Figure [T] into a systolic array, which is a weighted 
graph where the propagation of flow from node to node happens proportionally to the transition 
probabilities, and at a constant speed of one edge per alphabet symbol read by the automaton 
in the sequence; such a construction owes its name to the fact that it can be assimilated to an 
hydraulic circuit where the probability 1 enters from the left at time 0, and is then split and 
pumped at constant speed towards the following states of the automata, reaching more and more 
nodes as time goes by. 
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For example, let us suppose to start at time — that is, at sequence length L = — with 
all the probability in state 0. Since state has two outgoing connections, one to state 1 and 
one to itself, at time 1 (or, equivalcntly, at length L — 1) the content of state will have been 
pumped partly into state 1 and partly back to state 0; in particular, state 1 will now contain 
Pi(t = 1) = po(t = 0) • Pa = 1 • Pa = Pa, and po(t = 1) will be the new content of state 0, that is 
Po(t — 0) • pj — pj. The idea can easily be generalized to later times {i.e. longer sequences) if the 
structure of the automaton is taken into due account. 

By construction, at any given time (or sequence length L) the probability distribution p(n) as a 
function of the number of pattern occurrences n may thus be obtained just by adding the contents 
of the states three by three: for example, if L = 10thenp(0) — po(t = 10)+p±(t = 10)+p 2 (t = 10), 
p(l) = P3(10) +P4(10) +Ps(10), and so on (the extension to automata with different number of 
states is obvious); this direct interpretation should not be overlooked, since it is fundamental for 
the developments to come. We note that the probability flows and redistributes from node to 
node as a consequence of time evolution but its total amount is not consumed in the process; in 
fact, the sum of the weights of the outgoing connections for each node is by definition 1, being it 
also the sum of the transition probabilities over all possible symbols. The automaton of Figure [T] 
re- interpreted as a systolic array is shown in Figure [3j and its evolution up to t = 512 (that is, up 
to L = 512) is listed in Table Q] 
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Table 1: Flow of probability propagating through the systolic array of Figure [3] as the array consumes 
longer and longer stretches of genome. To compute this table we chose pa = Pc = Pa = Pi = 0.25. Apart 
from the probability of recording zero patterns, which always decreases exponentially after a transient 
phase (see Section [2.3(1 . the other probabilities show a bell-shaped behaviour w.r.t. genome length: the 
probability of recording exactly iV ba patterns grows, peaks for a certain length of the genome stretch, 
and decreases afterwards. It should be noted that starting from L > 8 the probabilities add up to a 
number which is smaller and smaller than 1, since more and more probability flux is transmitted to states 
Pb,Pio, 

The interesting point about such a construction is that it is able to keep track of the effects 
of all possible input sequences at the same time by superposition, as a consequence of the fact 
that the probability present in a node at time t splits over all possible transitions at time t + 1. 
Basically, we have turned our original automaton, which was a simple recognizer, into a much 
more powerful device, which can now compute the probability of being in each state after having 
read all possible inputs; it should thus not come as a surprise the statement that we are now able 
to scan in polynomial time sets of strings whose cardinality grows exponentially with the length 
of the sequence. 

We note that even a naive interpretation of this setup immediately yields a viable computational 
scheme. 

Algorithm 1 (Systolic Naive). Given a Markov model of order m and a motif m, compute 
p(x) as follows: 

1. generate the systolic automaton associated to m by the given Markov model; in particular, 
set a suitable initial condition pi(t = 0) for the states (1 ^ i ^ s) 

2. for t = 1 to L do 

evaluate the systolic propagation in the automaton: 
for i = 1 to s do 
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Pi(t) = o 
done 

foreach connection from state Sg 0urco to state Sg ink with weight Wk (1 ^ fc ^ a x s) do 

done 
done 

(x+l)-s-l 

3. return the probability distribution function as p(x) = Pn{L) , Vx G [0, A^s*]- □ 

n—x-s 

This algorithm possesses at least two very desirable properties: 

1. apart from possible small roundoff errors in the evaluation of Markovian statistics, it is exact 
and numerically stable. 

2. it allows the optimized evaluation of p(x) in the region x € [0, AT^*]; in fact, this effect may 
be obtained just by truncating the automaton after (vV„bs + 1) basic blocks, and letting the 
probability flux which should go to higher stages simply disperse. 

Considering that each state of the automaton has a possible outgoing transitions, the computa- 
tional cost of the algorithm in the latter truncated case can readily be written as 0(a x s x L x (AT b s + 1)), 
which by a striking coincidence turns out to be exactly the same as that of the very different 
FMCl-based exact algorithm presented in [2]. We also point out that taking advantage of the 
cutoff technique which will be explained in Section [3J we could slightly improve on this result by 
discarding the tail of very small elements coming ahead of the distribution. 

2.3 Further developments 

In fact, much better results may be obtained; to proceed, however, we have to develop a slightly 
more elaborate description of the problem. 

First of all, we observe that it is not really necessary to have in the automaton as many 
building blocks as the largest number of pattern occurrences AT^* = L/s which might appear in 
a sequence; in fact, it is possible to get a "folded" version of the automaton just by adding one 
single connection to its basic block. For example, the reader may easily convince themselves that 
the automaton of Figure [5] is entirely equivalent to that of Figure [31 provided that we supply a 
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Figure 4: The folded version of the systolic array of Figure [3] The transition from state 2 to state is a 
counting one, as indicated by the "+" symbol. Transitions to the end-of-input state have been omitted. 

mechanism to keep track of the fact that the probability flux associated to a number A^ of motif 
occurrences in state 2 becomes associated to A^ + 1 occurrences when passing to state through 
the special connection indicated by the "+" symbol; that is why we name such a link a counting 
transition. We will describe in a moment one possible way to implement a bookkeeping mechanism 
like the one just mentioned. We also note that our treatment introduces a slight but significant 
difference w.r.t. the related approach presented in [13] : in fact, as from the standard theory of DFA 
construction one finds in the latter work a final counting state instead of our counting transition, 
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resulting in our automaton being one state shorter; although we will not examine further this issue 
in the present paper, it is possible that our slightly more compact construction could indeed be 
used to improve some of the results obtained in [13] . 

Secondly, it is worthwhile to cast the problem in matrix form. To this end, we observe that 
it is possible to express the systolic propagation of the probability flux in terms of a transition 
matrix A4, which describes the modification in the content of the states of the automaton as the 
computation proceeds from time t to time t + 1 (that is, from the length L of the sequence read 
so far by the automaton to length L + 1). This goal may be obtained quite straightforwardly just 
by arranging as a matrix the transition probabilities from one state to another. 

In particular, if there were not any mechanism in place to record the number of observed 
patterns, M. could be defined quite simply as an s x s matrix of the form 

Mij := p{i -> j) . 

For example, for the automaton in figure 4 one would write 

Pi Pc\g 

Mo = I p A Pa Pa I , (7) 




pr 

the columns representing (in order) the state which the automaton leaves on reception of a new 
symbol and the rows being indexed by the state reached by the automaton: at the intersection 
between any column and any row one would find the probability of such a transition to happen. 
Note that by definition all the columns would be normalized to 1. 

We have only elucidated half of the matrix structure so far, though: our finite automata has 
a special counting transition, which allows us to remember, in any state, how many times the 
motif has been observed. A convenient way to represent this feature is by replacing the content 
of each state of the automaton with a polynomial: the coefficient of order k of the polynomial 
will then represent the probability that the motif has been observed k times. Accordingly to this 
convention, the elements of the transition matrix need to be replaced by a polynomial as well; in 
particular, one has to multiply the probability of the counting transition(s) by x, since their effect 
is to increase by one the number of motifs met so far. 

For example, assuming all the probabilities to be Pa = Pc = Pa = Pt = 0.25 the complete 
counting transition matrix for the automaton of figure 4 will now be 




M(x) = 0.25 0.25 0.25 , (8) 



the only difference w.r.t. the non-counting matrix A^o of ([7]) lying in the transition probability 
from state 2 to state 0, which now has been replaced by Pgit + Pc%- 

To complete the formalism, one just needs to describe how the probability flux is distributed 
among the states of the automaton at the beginning of the computation; this can be done by simply 
letting the product of matrices operate on a real-valued vector vq of length s which expresses 
the initial condition. The probabilities of being in any particular state can thus be deduced by 
multiplying the matrix M. by itself L times and applying the result to vq\ in the end, the elements 
of the obtained vector must be summed over, since we are not interested any longer in how the 
probability is distributed among the states of the automaton. The resulting polynomial will then 
correspond to our probability distribution. This way, we get to the final formula, Equation ([3|). 

The reader may easily verify that a repeated application of the matrix appearing in Equation j8]) 
to the initial condition defined by 

/ 1 \ 



v = 





V o j 
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indeed reproduces Table [T] In particular, all the construction carried out in this Section can be 
directly extended to the cases of generic pattern and of Markov model of generic order by supplying 
the correct automaton with its initial condition to the matrix formalism just described (this fact is 
used in [13] as well); indeed, different choices for the model or m only modify the actual contents 
of M. and vq in Equation ([3]), which remains correct. 

This way of restating the original problem elicits very interesting considerations. For example: 

1. po always decreases exponentially after a transient 

2. more stringent conclusions may be drawn about the unimodality of p(x) 

3. the causality requirement to the propagation of probability flux imposes specific constraints 
to the form of the transition matrix; such property might possibly be used to speed up the 
computation of p(x) even more. 

In general, although in the present paper we will not pursue any formal investigation, we emphasize 
that Equation ([3]) indeed is a perfect place where a rigorous study of the statistical properties of 
Markovian probability distribution functions can be started. 

Finally, it is straightforward to obtain the naive costs of the different possible strategies to 
evaluate Equation (|3|) ; we recall that such results have already been presented in Section 11.21 

3 Algorithm 

Everything is now in place to describe our algorithm for the fast evaluation of probability distri- 
bution functions of Markov models. 

The key observation to obtain an algorithm with superior performance w.r.t. that presented in 
[2] is to note that multiplication of polynomials may be better performed in terms of a convolution 
of their coefficients. 

Algorithm 2 (Polynomial multiplication by Discrete Fourier Transformation). Given 
two polynomials p(x) andq(x), compute pq{x) as follows: 

1. extend p(x) and q(x) to two polynomials p'{x) and q'{x) of degree a" = deg(p) + deg(g) by 
zero padding. 

2. define and compute 3"(p<?) as ^{pq) i '■= ■5(jJ)i-&(q')i> ^ i ^ d' . 

3. return the polynomial product as pq(x) = $ (J5(pq))- ^ 

Of course, the main point of interest for our application is that the naive evaluation of pq(x) en- 
tails (deg(p) + 1) x (deg(g) + 1) operations, while the use of Algorithm [5] coupled to Fast Fourier 
Transform (FFT) would require only 0(degpq x log(degpq)) operations to perform the same com- 
putation, making the behaviour switch from quadratic to linear. We observe that in the literature 
it is not easy to find direct applications of this scheme, possibly due to its sensitivity to numerical 
noise which will be analyzed in more detail below; however, the algorithm is indeed applied in 
indirect form in some cases, for example when computing products of modulo polynomials (as in 
the so called Karatsuba multiplication [33]). 

Let us note that if we consider polynomial-valued matrices the following relation holds: 

where by definition 

{d(M)).. ~$(M i:i (x)); 

this means that the Fourier operator applied to polynomial coefficients commutes with matrix 
multiplication. 



10 



A natural and appealing idea is then to use an FFT setup to directly compute the power A4 L of 
our polynomial- valued matrix appearing in ([3]). Nonetheless, two objections should be addressed. 

First of all, as from Algorithm [5] a (technical) problem when using the FFT algorithm in 
the context of polynomial multiplication is that care must be taken to ensure that FFT vectors 
are larger than the degree of the resulting polynomial, otherwise overlaps will occur. This issue 
increases the memory requirement of the algorithm by a factor of two, but otherwise it has no 
relevant effect on the evaluation strategy for p(x). 

A much more delicate issue is that the FFT algorithm is very sensitive to the noise introduced 
by rounding errors; in particular, if our bell-shaped distribution is to be represented as a (long) 
polynomial obtained by repeated multiplications of shorter ones, the typical condition number of 
the coefficients will be large, while the FFT algorithm can be faithful only to the coefficients whose 
magnitude does not differ from that of the largest coefficient for more than the numerical precision 
of the floating-point type used; this shall have the effect of making unreliable the computation of 
the smaller coefficients of the polynomial — that is, the computation of the tails of the distribution 
and hence of the p- value. However, it is easy to answer this objection, at least when the proposed 
algorithm is used to compute biologically relevant quantities. In detail: 

1. occurrences of motifs observed in practical situations only very rarely fall in the far tails of 
the distribution (and in any case, we can always quantify algorithmically when our computed 
p- value becomes unreliable). 

2. even if the computation of the tails becomes unreliable, the FFT algorithm is still perfectly 
able to deduce all the relevant statistical quantities of the distibution — which only depend 
on the bulk of the distribution, not on the exact knowledge of its tails — and hence the 
z-value, which is probably even more significant and informative than the p- value in the case 
of a very unlikely occurrence number. 

We can now formulate without concerns an effective strategy to evaluate A4 L in Equation ([3|), 
and thus the bulk of p{x), by an FFT-based technique. 

Algorithm 3 (Systolic Fast via FFT). Given a Markov model of order m and a motif m, 
compute p{x) as follows: 

1. generate the systolic automaton associated to m by the given model. 

2. deduce from the systolic automaton the transition matrix M. and the initial condition v$ . 

3. create the two auxiliary square matrices power and result of size s. 
4- initialize: powers- result <— ^(idcntity s ), N <— L. 

5. compute S"(A4 ) by binary decomposition in power as: 
while true do 

as necessary, apply a cutoff function to the elements of power and result: 
powers- cutoff {power), result <— cutoff (result); 
if N mod 2 = 1 then 

result <— result ■ power; 
N^[N/2\; 
if N = then 

break; 
power <— power ■ power 
done. 

6. return the probability distribution function as Tr(5' _1 ( result) ■ vq) . □ 
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We will explain soon how the cutoff function appearing at point (5) should be chosen; for the 
moment, let us pretend that it is the identity. In this case, recalling that = L/s by 

construction of the (systolic) automaton, the cost of the algorithm is given by 



0(3 x s 2 x (A^ x + 1) x log(7V™ x + 1) + 2 x s 3 x {N™ x + 1) x logL) = 0(s 2 x L x logL) , 



which is Equation ((4]); the first term in the l.h.s. comes from the Fourier transformations of A4, 
identity s and result, while the second one — which is dominant if no cutoff is applied — is due to 
the 2 x logL matrix multiplications occurring in the algorithm. 

It is worthwhile to note that, although already excellent from the point of view of performance 
(in particular w.r.t. the complexity of the FMCI algorithm [5] in the quadratic short-pattern 
regime, as already pointed out in Section f 1 - 2[) this scheme can be further improved by an appro- 
priate choice of the function cutoff which appears in Algorithm [3] 

In fact, we have already pointed out before that our knowledge of p(x) as obtained from an 
FFT-based algorithm is naturally limited by the numerical precision e := 10~ d of our floating- 
point type, d being the number of available decimal digits; as anticipated in Section 11.21 it is 
then natural to introduce a truncated distribution function p e (x), which is obtained from p(x) 
by removing its tails both on the left and the right extrema, in the region (if any) where their 
value falls below the threshold given by emaxp. As argued before, pj^x) keeps all the information 
which are needed to compute the statistical indicators we are interested in, and furthermore the 
truncation process does not introduce numerical instabilities; however, this choice of the cutoff 
has two contrasting implications on the complexity of the algorithm. 

The first one is that the cutoffmg step slows down the computation; in fact, both the matrices 
power and result appearing in Algorithm [3] store their elements in Fourier space, and as a result 
one needs to Fourier transform back and forth at each application of the cutoff. On the other 
hand, the second consequence is that after having applied the cutoff we need less polynomial 
coefficients than A" b | x + 1 = L/s + 1 to represent each element of M(x), since the tails where the 
distribution is small have now been eliminated; this fact turns out to be a relevant advantage in 
terms of computational efficiency in many cases, winning in particular a factor which is typically 
very big in the large-L regime, and thus justifying this choice of the cutoff. 

Defining base 6 as in Equation (|5|), we can then compute the final complexity of this new 
algorithm as 



from which Equation ([6]) immediately follows; depending on s the dominant term is usually the 
first one (which comes from the 4 x logL FFTs taking place during the evaluation of the cutoff 
function) , but in some practical cases it could be as well the second one (which comes from matrix 
multiplications). We note that this result is tipically — that is, for moderate values of s w.r.t. 
L — much better than that of Equation ([5]); so, the formula just obtained justifies the claim of 
Section 11.21 Another virtue of such a choice for the cutoff function is that it lowers the memory 
requirements of the algorithm, bringing them from 0(2 x s 2 x L/s) = 0(s x L) to the usually 
much smaller 



A remarkable feature of this computational scheme is that it is able to automatically lock itself 
onto the relevant part of the distribution function, leaving the tails off; no parameters need to be 
specified except for the cutoff, which in turn can be automatically determined with some simple 
heuristics from the numerical precision of the floating-point type used. 

Finally, we emphasize that cutting off the support is the only natural choice for an FFT 
algorithm, given again what has been described above about the sensitivity of the FFT to numerical 
noise; in particular, it is not possible to evaluate p(x) partially, truncating it at -/V b s as it is done 
in [5] or in Section |2"T21 On the other hand, the final cost ([5]) is to be understood as the cost to 
obtain the whole bulk of p(x), while the much higher cost dU of [2] is the cost of obtaining only 
the part of p{x) ranging from to N Q \> & . 



0(s 2 x base e (p) x logbase e (p) x (3 + 4 x logL) + 2 x s 3 x base e (p) x logL) 
O ( s 2 x base e (p) x log Lx (2 x s + 4 x log base e (p) ) ] 




(9) 
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4 Implementation 



All the considerations carried out in the last Sections have been gathered and implemented in a 
computer program called PATRONUS (from "PATtern Recognition by Optimized Numerical Universal 
Scoring"). The program is mostly written in Objective Caml [T5], a very- high-level functional 
programming language, with some C insets: the part of the code which computes the power M. L 
of the transition operator for the systolic DFA through Algorithm [3] (see Section [3|) is critical for 
the overall performance of the program since most of the total execution time is spent there; thus, 
the relative code has been optimized at low level using C, and packaged as an OCaml primitive. 
This architecture allows to get the best from both worlds (the superior formal power of OCaml, 
and the numerical efficiency of C) at the expense of only some minor performance penalty 

For the FFT-related code we used FFTW [16], a C library which is both very portable and 
carefully optimized, and offers sophisticated routines which transparently take advantage of vector 
SIMD extended instructions on the processors where they are present. 

The program reads in the sequence in FASTA format, and accepts many options. As for 
its general architecture, it behaves as a series of cascaded filters, the action of each stage being 
optional: 

1. the first stage produces a Markov model out of a given sequence and optionally writes it to 
a specified file, or reads an already existing model from a precomputed file. 

2. the second stage scans the sequence for a set of motifs described in terms of a regular 
expression or a IUPAC template, if the user specifies one on the commandline; it then 
optionally writes the set of motifs, together with the recorded occurrence numbers, to a 
specified file. Alternatively, the set may be read from a precomputed file. 

3. the third stage obtains the probability function pj^x) by running Algorithm [3] for all the 
couples (m, , iV bs(tni)) which have been found during the previous stage. 

As a result of this architecture, even complex examples like those presented in the next Section 
were produced with a few compact one- line invocations. More in detail, some of the offered features 
are particularly worth noting: 

1. many variations on the main numerical engine are supplied; it is possible to choose be- 
tween different floating-point precisions (single and double) and different memory-allocation 
schemes (slightly faster but more memory-hungry vs. slower but less memory-consuming). 

2. as in many modern similar programs — for instance in spatt |17j — , the implementation of 
the algorithm is completely independent of the symbols actually appearing in the alphabet A 
of the sequence of interest; this means that PATRONUS may be used without any modification 
to analyze DNA, proteins, or arbitrary strings as well. 

3. as mentioned before, arbitrary regular expressions may be specified as the set of motifs to be 
scanned for in the sequence; in addition, the standard IUPAC pattern encoding is accepted 
by the program. For instance, both the strings " : iupac_dna : NNNN" and are valid 
specifiers for an arbitrary sequence of 4 nucleotides. However, we would like to emphasize 
that for what regards IUPAC patterns and complex patterns in general we have adopted 
an approach different from that chosen in similar frameworks (contrasting for example with 
the one of |13|). since we think it more appropriate to biological applications: instead of 
producing a big automaton which matches the sometimes astronomical number of all possible 
motifs specified by a IUPAC pattern template — with some of them occurring in our sequence, 
but most of them never doing so — , we rather prefer to explicitely locate and evaluate only 
the motifs which effectively do appear in the sequence being studied. 

Constantly, and during the initial development phase in particular, a lot of care has been spent 
in checking the results against possible numerical errors by a variety of tests. First of all, the 
stability of the numerical engine has been studied by repeating the computations with different 
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Table 2: Comparative timings for various biological examples as obtained from (a) the exact FMCI method 
of [2] implemented in program spatt [T7| (b) PATRONUS, the exact method introduced in this paper (c) 
the FMCI large-deviations approximation, as given by the command ldspatt (d) the FMCI Gaussian 
approximation, provided by the command spatt — gaussian (in the column labeled gpatt). The timings 
were computed on an Intel Core Duo T2500 processor at 2 GHz with 2 GBytes of RAM excluding all 
input/output-related operations (like building Markov models and counting pattern occurrences). The 
gaps present in the current reference assembly of chromosome X have been discarded for these runs. A 
timing of oo means that the corresponding test did not terminate within 10 minutes; in case of orange 
background PATRONUS was faster than the large-deviations approximation, in case of yellow background 
PATRONUS showed a better performance also w.r.t. the Gaussian approximation. The tested versions are 
2.0-prel for spatt and 1.2.2 for ldspatt; as for PATRONUS, all the examples were run in double precision 
using build 68 with the default cutoff e = 10~ 14 . The timings for the case m = are very similar to those 
for m = 1, and thus they have been omitted. 
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numerical precisions; afterwards, the correctness of the obtained results has been challenged by 
comparison with brute-force enumerations tests for various motifs in strings of length L ^ 16; 
finally, the output of the program has been directly compared either to the solution given by the 
spatt program |17j for the regimes where the exact FMCI method would terminate in a reasonable 
amount of time, or to the large-deviations and Gaussian FMCI approximations — computed rcsp. 
by ldspatt and by spatt — gaussian — in the cases where the exact FMCI method was too 
slow. No significant discrepancy has ever been noticed during all the tests and examples which 
have been run. 

The program is free for academic and non-commercial use, and may be obtained from the 
corresponding author. Eventually, it will also be possible to retrieve it online from the URL [T8] . 

5 Application examples 

The interesting new possibilities allowed by a powerful tool like PATRONUS are so many that it 
is very difficult to illustrate them with just a few examples; however, after some reflection two 
situations have been identified and selected as typical for a large category of users. 

Exploiting the very good performances of PATRONUS we have addressed both the analysis of 
the human X chromosome (see Figure [5] and Table [5]), and of a set of more than 16.000 yeast 

1.2e-05 i 1 1 1 1 1 1 1 1 1 1 1 1 




3700000 3900000 4100000 4300000 

N ohs 

Figure 5: The exact distribution as computed by PATRONUS (black line) of the occurrences of the motif AAT 
in the human chromosome X for a Markov model of order m = 1, considering an alphabet of five letters 
"ACGNT" to keep into account the gaps still present in current reference assembly. The obtained values 
for the statistical indicators are: mean = 4153710, standard deviation = 36587.1, skewness = —0.635136, 
kurtosis = 18.5967. As expected in this regime, the FMCI Gaussian approximation (orange line) is very 
good, providing a value of 4153790 for the mean and of 36676.9 for the average, and thus a substantially 
correct z-value; however, the distribution and its Gaussian approximation are actually very different, as 
clearly shown by the inset in logarithmic scale. In turn, if the gaps in the assembly are discarded and the 
standard alphabet is employed, the distribution looks almost perfectly Gaussian (not shown). 

transcription-factor binding sites (see Figures [5] and [7]) . Neither problem can be practically ap- 
proached with spatt (see Table [2] where timings on the smaller genomes of HIV and B.Subtilis 
are shown as well); furthermore, nothing would stop PATRONUS from examining the whole human 
genome (~ 3 • 10 9 bases) in a feasible time. For example, assessing the relevance of all 320 3- and 
4-letter patterns in the X chromosome at m = 2 took only about 5 hours on our test machine. 
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Figure 6: The exact distributions as computed by PATRONUS in S.cerevisiae for the TATA-box core 
sequence, TATAAA, considering all Markov models from order to 3 (black lines). The FMCI Gaussian 
approximations are shown in green. Since N ba = 8635, the pattern appears to be underrepresented in all 
the models apart from the case m = 1. 

A first general conclusion may be drawn from the many tests we performed: the method seems 
to work in a very fast and extremely reliable way for a broad interval of the parameter range. 
The only limitation is that the order m of the Markov model employed should be 3; in fact, for 
m ^ 4 the algorithm becomes unpractical, due to large computational times and, more crucially, 
to excessive memory requirements. This fact may be readily understood from Equations ([6]) and 
(|9|) which express the computational and memory costs, since by construction the number s of 
states in the automaton is 

s = a m +1-1 -m, 

where I is the length of the examined motif. On the other hand, in the range ^ m ^ 2 the 
typical performance of the method is so good that it usually even consistently outperforms both 
the large-deviations and the Gaussian FMCI approximation of [5] , as shown by Table [H This 
is was matters practically, though, because using a Markov model with m ^ 3 typically implies 
severe uncertainty problems on the model itself [3J. 

The second main point is that our method always produces exact probability distribution 
functions from which all the information may be extracted; this can in principle allow for new 
interesting theoretical insights (see Figure [5]). 

Finally, our algorithm constitutes a very efficient benchmark for every possible approximate 
solution to the same problem (see Figure [7|) . 

6 Conclusions 

In this article, we show for the first time that a fast and accurate numerical evaluation of exact 
Markovian probability distribution functions in realistic cases of biological interest is possible. 
This is more and more important to get a reliable quantitative assessment of the relevance of 
biological sequences on the basis of their over- or under-representation, since the fast approximate 
methods used so far are known to systematically produce incorrect results. In fact, our algorithm 
retains the full ability of deducing all the relevant statistical information about motif occurrences, 
but its speed is comparable to that of some approximate algorithms, or even better in many cases: 
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Exact z-values (PATRONUS) 

Figure 7: Graph of the z-values at m = 1 of ~ 3000 transcription-factor binding sites in S.cerevisiae 
as computed by the Gaussian FMCI approximations vs. the exact results computed by PATRONUS. The 
transcription-factor binding sites have been extracted from 279 experimentally verified IUPAC templates. 
The graphs clearly show the regions where the approximation breaks down (that of large z-values) . Note 
that for these examples PATRONUS is usually faster than the approximation (cf. Table [2J. 

indeed, our successful analysis of motifs on the length scale of the human genome seems to prove 
that the exact Markovian approach should from now on be considered viable even on today's 
computers. Thus, we hope that this result will open the way to a more widespread use of exact 
methods in the analysis of biological sequences. 
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